Clustering

Olga Dethlefsen, Mun-Gwan Hong and Eva Freyhult

Introduction

  • Clustering is about grouping objects together according to similarity.
  • Objects are grouped so that objects within the same cluster are more similar to one another than to objects in other clusters.
  • Clustering is a type of unsupervised learning
  • Objects are clustered based on a set of \(p\) variables. The variables can be gene expression, protein concentration or other properties.
  • Clustering is commonly used for data exploration and to identify substructures in a data set.

Example: Clustering fruits

Partition methods vs. hierarchical

flowchart LR
  A([Type of splitting]) -.-> B([Partition])
  A -.-> C([Hierarchical])
  C -.-> D([Agglomerative])
  C -.-> E([Divisive])

Distance (or Dissimilarity)

All clustering algorithms require a measure of dissimilarity.

We will mainly discuss distances between quantitative variables, but ditances can also be computed for categorical variables.

Distances for quantitative variables

Euclidean distance

  • Straight-line distances in Euclidean space.
  • In n-dimensional space, the Euclidean distance between two points \(A = (a_1, a_2, ..., a_n)\) and \(B = (b_1, b_2, ..., b_n)\)

\[distance(A,B) = \sqrt{\sum_{i=1}^{n}(a_i - b_i)^2}\]

  • Might not be appropriate for high-dimensional data, due to the “distance concentration” phenomena (Aggarwal, Hinneburg, and Keim (2001)) (all pairwise distances between different data points converge to the same value).

Manhattan distance

  • The Manhattan or City Block metric

\[distance (A,B) = \sum|a_i - b_i|\]

  • Preferred over Euclidean distance when high dimensionality

Minkowski distance

\[distance (A,B) = \left(\sum|a_i - b_i|^{1/p}\right)^p\]

where Euclidean (p=2) and Manhattan (p=1) are the most common cases.

Pearson’s correlation coefficient

\[r = \frac{\sum(a_i-\bar a)(b_i-\bar b)}{\sqrt{\sum(a_i-\bar a)^2\sum(b_i-\bar b)^2}}\]

The correlation coefficient is a similarity measure, but can be transformed to a distance.

\[distance(A,B) = \sqrt{1-r}\]

  • Smaller when profiles are similar.

Partition methods vs. hierarchical

flowchart LR
  A([Type of splitting]) -.-> B([Partition])
  A -.-> C([Hierarchical])
  C -.-> D([Agglomerative])
  C -.-> E([Divisive])

k-means

  • A partitioning method, aims to divide objects into \(k\) discrete classes

k-means

  • Exactly \(k\) clusters of discrete classes, which is given to the algorithm.
  • The K-means algorithm minimizes the variance within clusters, by iteratively assigning each object to the cluster with the closest centroid (mean).
  • A cluster’s centroid is the arithmetic mean of all \(n_k\) objects in the cluster.

\[{\mathbf m}_k = \frac{1}{n_k} \sum_{i=1}^{n_k} {\mathbf x}_{i}\] - kmeans in R

The Lloyd and Forgy’s algorithm

  1. Initialization. Select \(k\) initial centroids. The initial centroids can be selected in several different ways. Two common methods are

    • Select \(k\) data points as initial centroids.
    • Randomly assign each data point to one out of \(k\) clusters and compute the centroids for these initial clusters.
  2. Assign each object to the closest centroid (in terms of squared Euclidean distance). The squared Euclidean distance between an object (a data point) and a cluster centroid \(m_k\) is \(d_i = \sum_{j=1}^p (x_{ij} - m_{kj})^2\). By assigning each object to closest centroid the total within cluster sum of squares (WSS) is minimized. \[WSS = \sum_{k=1}^K\sum_{i \in C_k}\sum_{j=1}^p (x_{ij} - m_{kj})^2\]

  3. Update the centroids for each of the \(k\) clusters by computing the centroid for all the objects in each of the clusters.

  4. Repeat steps 2 and 3 until convergence.

Interactive visualization.

K-means

K-means

K-means

K-means

K-means

K-means

K-means

K-means

K-means

K-means

K-means

K-means

K-means

K-means

K-means

K-means

K-means

K-means

K-means

Pros and cons of k-means clustering

  • Relatively fast and scalable
  • Easy to interpret the results due to the simple assignment
  • It works good if we can have the expected number of clusters.
  • Sensitive to the initial placement of cluster centroids.
  • Outliers can also significantly impact the position of centroids and lead to poor cluster assignments.

Partition methods vs. hierarchical

flowchart LR
  A([Type of splitting]) -.-> B([Partition])
  A -.-> C([Hierarchical])
  C -.-> D([Agglomerative])
  C -.-> E([Divisive])

Hierarchical clustering

  • agglomerative (bottom-up) and divisive (top-down).

Agglomerative

Divisive

Linkage & Agglomerative clustering

Linkage method - dissimilarity between clusters

Ward’s linkage

  • Minimize total within-cluster variance,
  • Merge clusters with the minimum increase in within-cluster sum of squares.
  • Euclidean or squared Euclidean distance only

Dendrograms of various linkages

Getting clusters

To find concrete clusters based on HCL we can cut the tree either using height or using number of clusters.

How many clusters?

Elbow method

  • Total within sum of squares, \(WSS\), that the k-means algorithm tries to minimize.
  • The inflection (bend, elbow) on the curve indicates an optimal number of clusters.

Gap statistic

The gap statistic measures the distance between the observed \(WSS\) and an expected \(WSS\) under a reference (null) model.

\(G(k) = E[\ln(WSS_{unif}(k))] - \ln(WSS(k))\)

Choose \(k\) as the smallest \(k\) such that \(G(k) \geq G(k+1) - s_{k+1}\).

Silhouette method

The silhouette value for a single object \(i\) is a value between -1 ans 1 that measure how similar the object is to other objects in the same cluster as compared to how similar it is to objects in other clusters.

The average silhouette over all objects is a measure of how good the clustering is, the higher the value the better is the clustering.

Silhouette method

  • for an object \(i\) in cluster \(C_a\):

    1. Average distance in the same cluster \(C_a\) \[a(i) = \frac{1}{|C_a|-1} \sum_{j \neq i, j \in C_a} d(i, j)\]
    2. Average distance between \(i\) and all objects in another cluster \(C_b\), \(d(i,C_b)\) and define the minimum; \[b(i) = \min_{b \neq a} d(i, C_b)\]
    3. The Silhouette index \[s(i) = \frac{b(i) - a(i)}{max(a(i), b(i))}\]
  • \(s(i)\) well clustered \(\approx 1\), lies between clusters \(\approx 0\), incorrectly clustered \(\approx -1\).

pvclust

  • Pvclust R package to assess the uncertainty in hierarchical cluster analysis (Suzuki and Shimodaira 2006)
  • Pvclust calculates probability values (p-values) for each cluster using bootstrap resampling techniques.

NbClust

  • A package in R that provides 30 indexes proposed for finding the optimal number of clusters

References

Aggarwal, Charu C., Alexander Hinneburg, and Daniel A. Keim. 2001. “On the Surprising Behavior of Distance Metrics in High Dimensional Space.” In Database TheoryICDT 2001, edited by Gerhard Goos, Juris Hartmanis, Jan Van Leeuwen, Jan Van Den Bussche, and Victor Vianu, 1973:420–34. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44503-X_27.
Suzuki, Ryota, and Hidetoshi Shimodaira. 2006. “Pvclust: An r Package for Assessing the Uncertainty in Hierarchical Clustering.” Bioinformatics 22. https://doi.org/10.1093/bioinformatics/btl117.